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Abstract: When testing a null hypothesis ffrj : 6 = 6$ in a Bayesian framework, the Savage-Dickey 
ratio (Dickey, 1971) is known as a specific representation of the Bayes factor (O'Hagan and Forster, 
2004) that only uses the posterior distribution under the alternative hypothesis at Bq, thus allowing 
for a plug-in version of this quantity. We demonstrate here that the Savage-Dickey representation 
is in fact a generic representation of the Bayes factor and that it fundamentally relies on specific 
measure-theoretic versions of the densities involved in the ratio, instead of being a special identity 
imposing some mathematically void constraints on the prior distributions. We completely clarify the 
measure-theoretic foundations of the Savage-Dickey representation as well as of the later generalisation 
of Verdinelli and Wasserman (1995). We provide furthermore a general framework that produces a 
converging approximation of the Bayes factor that is unrelated with the approach of Verdinelli and 
Wasserman (1995) and propose a comparison of this new approximation with their version, as well as 
with bridge sampling and Chib's approaches. 

Keywords and phrases: Bayesian model choice, Bayes factor, bridge sampling, conditional distri- 
bution, hypothesis testing, Savage-Dickey ratio, zero measure set. 



1. Introduction 

From a methodological viewpoint, testing a null hypothesis H : x ~ fo{x\u>o) versus the alternative 
H a ■ x ~ fx{x\u)x) in a Bayesian framework requires the introduction of two prior distributions, 7To(u;o) 
and 7i"i (wi), that are defined on the respective parameter spaces. In functional terms, the core object of the 
Bayesian approach to testing and model choice, the Bayes factor (Jeffreys, 1939, Robert, 2001, O'Hagan and 
Forster, 2004), is indeed a ratio of two marginal densities taken at the same observation x, 



(This quantity B Q i(x) is then compared to 1 in order to decide about the strength of the support of the 
data in favour of Hq or H a .) It is thus mathematically clearly and uniquely defined, provided both integrals 
exist and differ from both and 00. The practical computation of the Bayes factor has generated a large 
literature on approximative (see, e.g. Chib, 1995, Gelman and Meng, 1998, Chen et al., 2000, Chopin and 
Robert, 2010), seeking improvements in numerical precision. 

The Savage-Dickey (Dickey, 1971) representation of the Bayes factor is primarily known as a special 
identity that relates the Bayes factor to the posterior distribution which corresponds to the more complex 
hypothesis. As described in Verdinelli and Wasserman (1995) and Chen et al. (2000, pages 164-165), this rep- 
resentation has practical implications as a basis for simulation methods. However, as stressed in Dickey (1971) 
and O'Hagan and Forster (2004), the foundation of the Savage-Dickey representation is clearly theoretical. 

More specifically, when considering a testing problem with an embedded model, H Q : 9 = 9 , and a 
nuisance parameter -0, i.e. when wi can be decomposed as U)\ — (9, ip) and when ujq = (9q, ip), for a sampling 
distribution f(x\9,tp), the plug-in representation 




/ 7r (a;o)/o(.T|a;o)d^o _ m (x) 
/7ri(a;i)/i(a;|wi)tL;i m 1 (x) ' 




(1) 
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with the obvious notations for the marginal distributions 

tti(0) = /ttx^VW and 7ri(0|x) = f ni(6,ip\x)dil} , 

holds under Dickey's (1971) assumption that the conditional prior density of ip under the alternative model, 
given 9 — 9 , 7r 1 (^|0 o ), is equal to the prior density under the null hypothesis, Ttofy), 

7TlM0o)=7roWO. (2) 

Therefore, Dickey's (1971) identity (1) reduces the Bayes factor to the ratio of the posterior over the prior 
marginal densities of 9 under the alternative model, taken at the tested value 9q. The Bayes factor is thus 
expressed as an amount of information brought by the data and this helps in its justification as a model 
choice tool. (See also Consonni and Veronese, 2008.) 

In order to illustrate the Savage-Dickey representation, consider the artificial example of computing the 
Bayes factor between the models 

M : a# ~ JVty, 1), ^~JV(0,1), 

and 

SUIi : x\0,if> ~ Af(i/>,6), 1>\6~Af(O,0), 6~IQ(1,1), 

which is equivalent to testing the null hypothesis H : 9 = 9 = 1 against the alternative Hi : ^ 1 when 
x\9, ip ~ 7V"(^>, 9). In that case, model VJIq clearly is embedded in model 9Jli. We have 

m (x) = exp (-x 2 /4) /(v^V^) and m x (x) = (l + .t 2 /4)~ 3/2 r(3/2)/(v / 2v / 27r) , 
and therefore 

Soi(ar) = T(3/2)- 1 (l + x 2 /4) 3/2 exp (-x 2 /4) . 
Dickey's assumption (2) on the prior densities is satisfied, since 

7ri(V|0o) = -4= exp (-^ 2 /2) = 7r (V) ■ 

V 27T 

Therefore, since 

tti(0) =0- 2 exp(-0- 1 ) , 7n(0o)=e X p(-l), 

and 

wi(0\x) = T(3/2)- 1 (1 + x 2 /Af' 2 9- 5 ' 2 exp (-9' 1 (l + x 2 /A)) I e>0 , 

ttiC^oN) = T(3/2)- 1 (1 + x 2 /4) 3/2 exp (- (1 + x 2 /A)) , 

we clearly recover the Savage-Dickey representation 

B i(x) = T(3/2)- 1 (1 + z 2 /4) 3/2 exp (-x 2 /4) = m(e \x)/ 7^(60) . 

While the difficulty with the representation (1) is usually addressed in terms of computational aspects, 
given that ni(9\x) is rarely available in closed form, we argue in the current paper that the Savage-Dickey 
representation faces challenges of a deeper nature that led us to consider it a 'paradox'. First, by considering 
both prior and posterior marginal distributions of 9 uniquely under the alternative model, (1) seems to 
indicate that the posterior probability of the null hypothesis Hq : 9 = 9q is contained within the alternative 
hypothesis posterior distribution, even though the set of (0,^)'s such that 9 = 9q has a zero probability 
under this alternative distribution. Second, as explained in Section 2, an even more fundamental difficulty 
with assumption (2) is that it is meaningless when examined (as it should) within the mathematical axioms 
of measure theory. 

Having stated those mathematical difficulties with the Savage-Dickey representation, we proceed to show 
in Section 3 that similar identities hold under no constraint on the prior distributions. In Section 3, we 
derive computational algorithms that exploit these representations to approximate the Bayes factor, in an 
approach that differs from the earlier solution of Verdinelli and Wasserman (1995). The paper concludes 
with an illustration in the setting of variable selection within a probit model. 
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2. A measure-theoretic paradox 



When considering a standard probabilistic setting where the dominating measure on the parameter space 
is the Lebesgue measure, rather than a counting measure, the conditional density iri(ip\6) is rigorously 
(Billingslcy, 1986) defined as the density of the conditional probability distribution or, equivalently, by the 
condition that 



,tp) £ A t x A 2 ) = / Tri(il>\9)d<tpir 1 (6)d6= tti(0, d6 , 

J Ai J A 2 J A\Y.Ai 

for all measurable sets A\ x A2, when iri(9) is the associated marginal density of 9. Therefore, this identity 
points out the well-known fact that the conditional density function iri(ip\9) is defined up to a set of measure 
zero both in ip for every value of 9 and in 9. This implies that changing arbitrarily the value of the function 
tti(-\9) for a negligible collection of values of 9 does not impact the properties of the conditional distribution. 

In the setting where the Savage-Dickey representation is advocated, the value 9o to be tested is not 
determined from the observations but it is instead given in advance since this is a testing problem. Therefore 
the density function 

may be chosen in a completely arbitrary manner and there is no possible reason for a unique representation 
of t:i(tJj\9q) that can be found within measure theory. This implies that there always is a version of the 
conditional density iri(ip\9 Q ) such that Dickey's (1971) condition (2) is satisfied — as well as, conversely, 
there are an infinity of versions for which it is not satisfied — . As a result, from a mathematical perspective, 
condition (2) cannot be seen as an assumption on the prior iri without further conditions, contrary to what is 
stated in the original Dickey (1971) and later in O'Hagan and Forster (2004), Consonni and Veronese (2008) 
and Wetzels et al. (2010). This difficulty is the first part of what we call the Savage-Dickey paradox, namely 
that, as stated, the representation (1) relies on a mathematically void constraint on the prior distribution. 
In the specific case of the artificial example introduced above, the choice of the conditional density ■k 1 {iI:\9q) 
is therefore arbitrary: if we pick for this density the density of the A/"(0, 1) distribution, there is agreement 
between tti(iI>\9o) and 7To(V0j while, if we select instead the function exp(+i/> 2 /2), which is not a density, there 
is no agreement in the sense of condition (2). The paradox is that this disagreement has no consequence 
whatsoever in the Savage-Dickey representation. 

The second part of the Savage-Dickey paradox is that the representation (1) is solely valid for a specific 
and unique choice of a version of the density for both the conditional density 7ri(V>|#o) an d the joint density 
7ri(#o, V0- When looking at the derivation of (1), the choices of some specific versions of those densities are 
indeed noteworthy: in the following development, 

Bm W = r la i\ti \a i\a 1 ao by dcnnltlon 

J ^i(9,ip)f{x\9,ip)d^d9 



J 7r 1 (^|0o)/(x|go,^)d^^ 1 (g o ) 
J n 1 (9 ) i/j)f(x\9,ij) d^de7n(e ) 
Jn 1 (9 ,i>)f(x\9 0l iP)d^ 

mi(x)Tr 1 (9 ) 
tti(9 \x) 



[using a specific version of 7n(V'|^o)] 
[using a specific version of 7r 1 (0 o ,'0)] 
[using a specific version of tti(8q\x)] 



the second equality depends on a specific choice of the version of ni(ip\6o) but not on the choice of the 
version of tti(9q), while the third equality depends on a specific choice of the version of 9q) as equal to 
^0(^)^1(^0)1 thus related to the choice of the version of 7r 1 (0 o )- The last equality leading to the Savage-Dickey 
representation relies on the choice of a specific version of tti(9q\x) as well, namely that the constraint 

= /7ro(V0/(:r|flo»dV> 
7Ti(0 o ) mi(x) 
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holds, where the right hand side is equal to the Bayes factor Boi(x) and is therefore independent from 
the version. This rigorous analysis implies that the Savage-Dickey representation is tautological, due to the 
availability of a version of the posterior density that makes it hold. 

As an illustration, consider once again the artificial example above. As already stressed, the value to be 
tested = 1 is set prior to the experiment. Thus, without modifying either the prior distribution under 
model 9Jli or the marginal posterior distribution of the parameter 9 under model StJli, and in a completely 
rigorous measure-theoretic framework, we can select 

7ri(0 o ) = 100 = ^(^0^). 

For that choice, we obtain 

7ri(0 o |aOAri(0 o ) = 1 + B 01 (x) = T(3/2)- 1 (l + x 2 /4) 3/2 exp (-x 2 /A) . 
Hence, for this specific choice of the densities, the Savage-Dickey representation does not hold. 



Verdinelli and Wasserman (1995) have proposed a generalisation of the Savage-Dickey density ratio when 
the constraint (2) on the prior densities is not verified (we stress again that this is a mathematically void 
constraint on the respective prior distributions). Verdinelli and Wasserman (1995) state that 



B 01 (x) 



/7r o ffl/0r|fl o ,VW 



uii(x) 

j 7r o (j;)f(x\0o^)d^ 



= ni(6 \x) 

= ni(9 \x) 
= 7ri(6>o|a;) 

_ 7Ti(go|g) 

7Tl(0 O ) J 7T X (# 

7Tl(0 O ) 



m^x^^e^x) 



dip 



mi(x)7ri(9o\x) TTi(ip\9 ) 

1"1 (V'l^o) mt(x)lXl(9(j\x) TTl(^o) 



[by definition] 

[for any version of 7Ti(0o|#)] 

[for any version of 7Ti(^|0o)] 

[for any version of 7Ti(0o)] 

[for a specific version of wi(ijj\6o,x)] 



Tro(V') 
7Ti(V|^o) 



This representation of Verdinelli and Wasserman (1995) therefore remains valid for any choice of versions 
for ni(9 \x), 7I"i(0q), 7i"i (^\9 ), provided the conditional density ni(ijj\0o,x) is defined by 

TTl(lp\0 O ,X) = j— t-— , 

mi(x)Tri(0 o \x) 

which obviously means that the Verdinelli- Wasserman representation 



Boi(x) = !lMpWI^) 



Mil>\0o)\ 



(3) 



is dependent on the choice of a version of tv\(9q). 



We now establish that an alternative representation of the Bayes factor is available and can be exploited 
towards approximation purposes. When considering the Bayes factor 

JV ffl/(x|fl ,VW 7Tl(flo) 

01 {x) jTT 1 (e,ip)f(x\e,ip)d*t>de n^o) ' 

where the right hand side obviously is independent of the choice of the version of 7r 1 (0 o ), the numerator can 
be seen as involving a specific version in 9 = 9 of the marginal posterior density 

7ri(0|a:)a J ir (J>)f(x\0,1>) d^ 71,(9) , 
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which is associated with the alternative prior tt\{9, ip) = TTi(9)n (ip). Indeed, this density tti(9\x) appears as 
the marginal posterior density of the posterior distribution defined by the density 



771(0, = 



fhi(x) 



where rh\ (x) is the proper normalising constant of the joint posterior density. In order to guarantee a Savage- 
Dickey-like representation of the Bayes factor, the appropriate version of the marginal posterior density in 
8 = 9q, tti(9o\x), is obtained by imposing 



7Ti(0b|a:) _ / M*l>)f(x\9o,il>)dil> 



(4) 



7T O (0o) rhi(x) 

where, once again, the right hand side of the equation is uniquely defined. This constraint amounts to 
imposing that Bayes' theorem holds in 9 = 9q instead of almost everywhere (and thus not necessarily in 
= 9q). It then leads to the alternative representation 



B 01 (x) = 



771(6*0) mi (a;) 



which holds for any value chosen for 771(6*0) provided condition (4) applies. 

This new representation may seem to be only formal, since both mi (a;) and rhx(x) are usually unavailable 
in closed form, but we can take advantage of the fact that the bridge sampling identity of Torrie and Valleau 
(1977) (see also Gelman and Meng, 1998) gives an unbiased estimator of rhi{x)/m\{x) since 



TT (^l(0)f(x\9^) 



7Ti(6>,V0/(*IM) 

In conclusion, we obtain the representation 



rhi(x) 
m\(x) 



B, 



01 



7Tl(9 ) 



(5) 



whose expectation part is uniquely defined (in that it docs not depend on the choice of a version of the densi- 
ties involved therein) , while the first ratio must satisfy condition (4) . We further note that this representation 
clearly differs from Verdinelli and Wasserman's (1995) representation: 



7Tl(0o) 



7Tg_(V0 

%) 



(6) 



since (6) uses a specific version of the marginal posterior density on 9 in 6*0, as well as a specific version of 
the full conditional posterior density of ip given 9q 



3. Computational solutions 

In this Section, we consider the computational implications of the above representation in the specific case 
of latent variable models, namely under the practical possibility of a data completion by a latent variable z 
such that 

f(x\6,1>) = J f(x\9^,z)f(z\9,i;)dz 
when ni(9\x, ip, z) oc -Ki(9)f{x\9, ip, z) is available in closed form, including the normalising constant. 

We first consider a computational solution that approximates the Bayes factor based on our novel rep- 
resentation (5). Given a sample (d^,^ 1 ^, z^), ■ ■ ■ , {9^ T \ ip^ T \ z^) simulated from (or converging to) the 
augmented posterior distribution TTi(9,ip, z\x), the sequence 

i f>i(0oM 4 ^ (t) ) 
t=i 
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converges to tt\{9q\x) in T under the following constraint on the selected version of tt\{9q\x, z, ip) used therein: 



TT 1 (9 \x,z,ip) 



f(x,z\9 ,ip) 



which again amounts to imposing that Bayes' theorem holds in 9 = 9q for tt\{9\x, z, ip) rather than almost 
everywhere. (Note once more that the right hand side is uniquely defined, i.e. that it does not depend on 
a specific version.) Therefore, provided iid or MCMC simulations from the joint target 7ri(0, ip, z\x) are 
available, the converging approximation to the Bayes factor Bqi(x) is then 



1 y. ThCflolar.fW, 7/j (t) ) fhijx) 

t=l 



vri(^o) 



m\{x) 



(We stress that the simulated sample is produced for the artificial target 7fi(#, ip, z\x) rather than the true 
posterior tti(9 , ip , z\x) if TTi(9,ip) ^ T7i(9,ip).) Moreover, if (9^' ,ip^), . . . , {9^ T \ip^) is a sample indepen- 
dently simulated from (or converging to) ni(9,ip\x), then 



1 T 
T 



is a convergent and unbiased estimator of rh\{x) j 'm\(x) . Therefore, the computational solution associated 
to our representation (5) of Bqi(x) leads to the following unbiased estimator of the Bayes factor: 



-MR 



T _ 
TTl 



Note that 



implies that 



01 



(8,il>\x) 



{9 Q \x,z^\i>^) 1 



T 
\ " 



7T0 



( = 1 



7Ti(0 o ) T ^7ri(^*)|eW) 



(7) 



M^i(0)f(x\9^) 



= w 



(0,il>\x) 



mi(x) 
fhi(x) 



T 



t=i 



is another convergent (if biased) estimator of rhi(x)/mi(x). The availability of two estimates of the ratio 
rhi(x)/mi(x) is a major bonus from a computational point of view since the comparison of both estimators 
may allow for the detection of infinite variance estimators, as well as for coherence of the approximations. 
The first approach requires two simulation sequences, one from tt\(9 1 ip\x) and one from ni(9, ip\x), but this is 
a void constraint in that, if Hq is rejected, a sample from the alternative hypothesis posterior will be required 
no matter what. Although we do not pursue this possibility in the current paper, note that a comparison 
of the different representations (including Verdinelli and Wasserman's, 1995, as exposed below) could be 
conducted by expressing them in the bridge sampling formalism (Gelman and Meng, 1998). 



We now consider a computational solution that approximates the Bayes factor and is based on Verdinelli 
and Wasserman (1995)'s representation (6). Given a sample (9^\ ip^\ • •• > (9^ T \ ^ T ') simulated 
from (or converging to) 7Ti(0, ip, z\x), the sequence 

converges to 7Tx (^o|cc) under the following constraint on the selected version of 7Ti(^o|^ 5 z,ip) used there: 

TT 1 (9 \x,z,tp) _ f(x,z\9 ,ip) 

m(9 ) ~ f f(x, ^9,^(9) d9' 
6 




is a sample generated from (or converging to) 7Ti(?/>, z\x, 9q), the 



sequence 



1 



T 



T 



E 



t=l 



is converging to 



E" 1 



(V|x,e ) [ ^oW 



under the constraint 

7ri(^, z|0 o , ar) cx /(x, z|0 o , (V'l^o) ■ 

Therefore, the computational solution associated to the Verdinclli and Wasserman (1995) 's representation 
of Bqi(x) (6) leads to the following unbiased estimator of the Bayes factor: 



Although, at first sight, the approximations (7) and (8) may look very similar, the simulated sequences used 
in both approximations differ: the first average involves simulations from tt\(0 1 i/j, z\x) and from ni(6, z\x), 
respectively, while the second average relies on simulations from 7Ti(#, z\x) and from tti(^, z\x, 6q), respec- 
tively. 

4. An illustration 

Although our purpose in this note is far from advancing the superiority of the Savage-Dickey type represen- 
tations for Bayes factor approximation, given the wealth of available solutions for embedded models (Chen 
et al., 2000, Marin and Robert, 2010), we briefly consider an example where both Verdinclli and Wasserman's 
(1995) and our proposal apply. The model is the Bayesian posterior distribution of the regression coefficients 
of a probit model, following the prior modelling adopted in Marin and Robert (2007) that extends Zellner's 
(1971) g-prior to generalised linear models. We take as data the Pima Indian diabetes study available in R 
(R Development Core Team, 2008) dataset with 332 women registered and build a probit model predicting 
the presence of diabetes from three predictors, the glucose concentration, the diastolic blood pressure and 
the diabetes pedigree function, assessing the impact of the diabetes pedigree function, i.e. testing the nullity 
of the coefficient associated to this variable. For more details on the statistical and computational issues, 
see Marin and Robert (2010) since this paper relies on the Pima Indian probit model as benchmark. 

This probit model is a natural setting for completion by a truncated normal latent variable (Albert and 
Chib, 1993). We can thus easily implement a Gibbs sampler to produce output from all the posterior distri- 
butions considered in the previous Section. Besides, in that case, the conditional distribution ■Ki{9\x 1 ip, z) is 
a normal distribution with closed form parameters. It is therefore straightforward to compute the unbiased 
estimators (7) and (8). Figure 1 compares the variation of this approximation with other standard solu- 
tions covered in Marin and Robert (2010) for the same example, namely the regular importance sampling 
approximation based on the MLE asymptotic distribution, Chib's version based on the same completion, 
and a bridge sampling (Gclman and Meng, 1998) solution completing 7r (-) with the full conditional being 
derived from the conditional MLE asymptotic distribution. The boxplots are all based on 100 replicates of 
T = 20, 000 simulations. While the estimators (7) and (8) are not as accurate as Chib's version and as the 
importance sampler in this specific case, their variabilities remain at a reasonable order and are very com- 
parable. The R code and the reformated datasets used in this Section are available at the following address: 
http : //www . math . univ-montp2 . f r/~marin/ savage/ dickey . html. 
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(8) 



o 

— I — 
Bridge 



MR 



VW 



Chib 



IS 



Fig 1. Comparison of the variabilities of five approximations of the Bayes factor evaluating the impact of the diabetes pedigree 
covariate upon the occurrence of diabetes in the Pima Indian population, based on a probit modelling. The boxplots are based 
on 100 replicas and the Savage-Dickey representation proposed in the current paper is denoted by MR, while Verdinelli and 
Wasserman's (1995) version is denoted by VW. 
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